IST 707 HW2

Yash Pasar

This Home work assignment deals with using clustering and decison tree techniques to find conclusive evidence to the disputed essay mystery between Hamilton and Madison. Steps Involved in the process:-

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import preprocessing
import seaborn as sns
from scipy.spatial import distance_matrix
from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score, cross_validate, ShuffleSplit, LeaveOneOut
from sklearn.tree import export_graphviz
from sklearn.externals.six import StringIO  
from IPython.display import Image  
import pydotplus
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from pydot import graph_from_dot_data
from sklearn import metrics
from sklearn.decomposition import PCA
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode, iplot
np.random.seed(66)
/Users/yashpasar/anaconda3/lib/python3.7/site-packages/sklearn/externals/six.py:31: DeprecationWarning: The module is deprecated in version 0.21 and will be removed in version 0.23 since we've dropped support for Python 2.7. Please rely on the official version of six (https://pypi.org/project/six/).
  "(https://pypi.org/project/six/).", DeprecationWarning)
In [2]:
pd.set_option('display.max_columns', 999)
In [3]:
Essay_Tree = pd.read_csv('/Users/yashpasar/Downloads/Disputed_Essay_data.csv')
print('Columns with null values:' ,sum(list(Essay_Tree.isnull().any())))
Columns with null values: 0

Data Preperation

In [4]:
cv_Essay_Tree = Essay_Tree.loc[Essay_Tree['author'].isin(['Hamilton', 'Madison']), :]
test_Essay_Tree = Essay_Tree.loc[Essay_Tree['author'].isin(['dispt','HM']), :]
In [5]:
cv_Essay_Tree.shape, test_Essay_Tree.shape
Out[5]:
((66, 72), (14, 72))

Since the clustering is sensitive to range of data. It is advisable to scale the data before proceding.

In [6]:
min_max_scaler = preprocessing.MinMaxScaler()
features = cv_Essay_Tree.iloc[:, 2:72].values
features = min_max_scaler.fit_transform(features)

label = cv_Essay_Tree.iloc[:, 0].values
labels = [0 if i=='Hamilton' else 1 for i in label]

X_train, X_valid, y_train, y_valid = train_test_split(features, labels, test_size=0.3, random_state=1) # 70% training and 30% test

Clustering Analysis

In [7]:
kmeans = KMeans(n_clusters=2, n_init=25, max_iter=100, random_state=6)
kmeans.fit(X_train)
Out[7]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=100,
       n_clusters=2, n_init=25, n_jobs=None, precompute_distances='auto',
       random_state=6, tol=0.0001, verbose=0)
In [8]:
train_pred = kmeans.predict(X_train)
print(classification_report(y_train, train_pred, target_names = ['Hamilton', 'Madison']))
              precision    recall  f1-score   support

    Hamilton       0.38      0.17      0.23        36
     Madison       0.00      0.00      0.00        10

    accuracy                           0.13        46
   macro avg       0.19      0.08      0.12        46
weighted avg       0.29      0.13      0.18        46

In [9]:
valid_pred = kmeans.predict(X_valid)
print(classification_report(y_valid, valid_pred, target_names = ['Hamilton', 'Madison']))
              precision    recall  f1-score   support

    Hamilton       0.29      0.13      0.18        15
     Madison       0.00      0.00      0.00         5

    accuracy                           0.10        20
   macro avg       0.14      0.07      0.09        20
weighted avg       0.21      0.10      0.14        20

In [10]:
from scipy.spatial.distance import cdist
distortions = []
inertias = []
K = range(1, 10)
for k in K:
    kmeanModel = KMeans(n_clusters=k).fit(X_train)
    kmeanModel.fit(X_train)
    distortions.append(sum(np.min(cdist(X_train, kmeanModel.cluster_centers_, 'euclidean'), axis=1)) / X_train.shape[0])
    inertias.append(kmeanModel.inertia_) 
    
plt.plot(K, distortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Distortion')
plt.title('The Elbow Method to Identify Optimal k')
plt.show()
In [11]:
distortions
Out[11]:
[1.8131019840987435,
 1.7462974631061567,
 1.7068665782857384,
 1.6504551067224864,
 1.6344127978478273,
 1.6029354897749573,
 1.5659142281797573,
 1.5106637751323921,
 1.473101103073943]
In [12]:
plt.plot(K, inertias, 'bx-') 
plt.xlabel('Values of K') 
plt.ylabel('Inertia') 
plt.title('The Elbow Method using Inertia') 
plt.show() 
In [13]:
inertias
Out[13]:
[153.9515173363794,
 143.52918779601163,
 137.06052821913948,
 130.7364059633394,
 125.22804225570164,
 121.22265059927945,
 115.65873822108593,
 110.27274201385984,
 106.82942918288651]

Since we have to decide whether the disputed essays belong to Madison or Hamilton, we know that the clusters should be 2. We hence confirm the same from the above two plots of distortion and inertia that the ideal number of clusters should be 2.

In [14]:
#Principal component separation to create 2 dim picture
pca=PCA(n_components=2)
train_PCA = pca.fit_transform(X_train)
train_PCA_1 = train_PCA[:, 0]
train_PCA_2 = train_PCA[:, 1]
train_pred_df = pd.DataFrame({'pc1':train_PCA_1, 'pc2':train_PCA_2, 'Prediction': train_pred  })
train_pred_df.head()
Out[14]:
pc1 pc2 Prediction
0 0.524651 -1.049297 1
1 0.156084 0.539797 0
2 -0.117350 -0.588623 1
3 0.280543 -0.213029 1
4 0.278350 1.162510 0
In [51]:
%matplotlib inline
trace0= go.Scatter(x=train_pred_df[train_pred_df.Prediction == 0]['pc1'],
                   y=train_pred_df[train_pred_df.Prediction == 0]['pc2'],
                   name="Prediction for Cluster 0",
                   mode ="markers",
                   marker =dict(size=10,color="rgba(15,152,152,0.5)",line=dict(width=1,color="rgb(0,0,0)")))

trace1= go.Scatter(x=train_pred_df[train_pred_df.Prediction == 1]['pc1'],
                   y=train_pred_df[train_pred_df.Prediction == 1]['pc2'],
                   name="Prediction for Cluster 1",
                   mode ="markers",
                   marker =dict(size=10,color="rgba(180,18,180,0.5)",line=dict(width=1,color="rgb(0,0,0)")))

fig = go.Figure()
fig.add_trace(trace0)
fig.add_trace(trace1)
fig.show(renderer="notebook")
In [16]:
#Principal component separation to create 2 dim picture
valid_PCA = pca.fit_transform(X_valid)
valid_PCA_1 = valid_PCA[:, 0]
valid_PCA_2 = valid_PCA[:, 1]
valid_pred_df = pd.DataFrame({'pc1':valid_PCA_1, 'pc2':valid_PCA_2, 'Prediction': valid_pred  })
valid_pred_df.head()
Out[16]:
pc1 pc2 Prediction
0 -0.077346 1.355654 1
1 -0.086545 0.282950 0
2 -0.053415 0.068117 1
3 -0.088009 -0.651012 1
4 0.121790 -0.313610 1
In [50]:
%matplotlib inline
trace0= go.Scatter(x=valid_pred_df[valid_pred_df.Prediction == 0]['pc1'],
                   y=valid_pred_df[valid_pred_df.Prediction == 0]['pc2'],
                   name="Prediction for Cluster 0",
                   mode ="markers",
                   marker =dict(size=10,color="rgba(15,152,152,0.5)",line=dict(width=1,color="rgb(0,0,0)")))

trace1= go.Scatter(x=valid_pred_df[valid_pred_df.Prediction == 1]['pc1'],
                   y=valid_pred_df[valid_pred_df.Prediction == 1]['pc2'],
                   name="Prediction for Cluster 1",
                   mode ="markers",
                   marker =dict(size=10,color="rgba(180,18,180,0.5)",line=dict(width=1,color="rgb(0,0,0)")))


fig = go.Figure()
fig.add_trace(trace0)
fig.add_trace(trace1)
fig.show(renderer="notebook")
In [49]:
from scipy.cluster.hierarchy import ward, dendrogram, cut_tree
import scipy.spatial.distance as ssd
# convert the redundant n*n square matrix form into a condensed nC2 array
linkage_matrix = ward(features)

fig, ax = plt.subplots(figsize=(20, 10))
ax.grid(False)
ax.set_title('Cluster Dendrogram', fontsize = 25)
ax = dendrogram(linkage_matrix, orientation='top', labels=label)
plt.xticks(fontsize=15)
plt.show()
In [19]:
hac = AgglomerativeClustering()
pred = hac.fit_predict(features)
print(classification_report(labels, pred, target_names = ['Hamilton', 'Madison']))
              precision    recall  f1-score   support

    Hamilton       0.98      0.88      0.93        51
     Madison       0.70      0.93      0.80        15

    accuracy                           0.89        66
   macro avg       0.84      0.91      0.86        66
weighted avg       0.92      0.89      0.90        66

Decision Tree Analysis

In [20]:
cv_Essay_Tree.shape, test_Essay_Tree.shape
Out[20]:
((66, 72), (14, 72))
In [21]:
features = cv_Essay_Tree.iloc[:, 2:].values
features = min_max_scaler.fit_transform(features)

label = cv_Essay_Tree.iloc[:, 0].values
labels = [0 if i=='Hamilton' else 1 for i in label]

X_train, X_valid, y_train, y_valid = train_test_split(features, labels, test_size=0.3, random_state=1) # 70% training and 30% test
In [22]:
clf = DecisionTreeClassifier(random_state = 25)
clf.fit(X_train, y_train)
pred = clf.predict(X_valid)
print(classification_report(y_valid, pred, target_names = ['Hamilton', 'Madison']))
              precision    recall  f1-score   support

    Hamilton       1.00      1.00      1.00        15
     Madison       1.00      1.00      1.00         5

    accuracy                           1.00        20
   macro avg       1.00      1.00      1.00        20
weighted avg       1.00      1.00      1.00        20

In [23]:
clf.tree_.max_depth
Out[23]:
2
In [24]:
print(f"Accuracy: {round(metrics.accuracy_score(y_valid, pred)*100)}%")
Accuracy: 100.0%

Next, I will graph the Decision Tree to get a better visual of what the model is doing, by printing the DOT data of the tree, graphing the DOT data using pydontplus graph_from_dat_data method and displaying the graph using IPython.display Image method.

In [25]:
dot_data = StringIO()
export_graphviz(clf, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True,feature_names = Essay_Tree.columns[1:71])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
graph.write_png('Essay_Tree.png')
Image(graph.create_png())
Out[25]:

Using GridSearchCV for optimizing parameters

This model makes it difficult to understand and interpret our disputed essays problem. One interesting observation is that this model predicted all disputed essays as Madison!This is interesting.

In [26]:
param_grid = {'criterion': ['gini', 'entropy'],
              'min_samples_split': [2, 10, 20],
              'max_depth': [5, 10, 20, 25, 30],
              'min_samples_leaf': [1, 5, 10],
              'max_leaf_nodes': [2, 5, 10, 20]}
grid = GridSearchCV(clf, param_grid, cv=10, scoring='accuracy')
grid.fit(X_train, y_train)
/Users/yashpasar/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_search.py:813: DeprecationWarning:

The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.

Out[26]:
GridSearchCV(cv=10, error_score='raise-deprecating',
             estimator=DecisionTreeClassifier(class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features=None,
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              presort=False, random_state=25,
                                              splitter='best'),
             iid='warn', n_jobs=None,
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': [5, 10, 20, 25, 30],
                         'max_leaf_nodes': [2, 5, 10, 20],
                         'min_samples_leaf': [1, 5, 10],
                         'min_samples_split': [2, 10, 20]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='accuracy', verbose=0)
In [27]:
print(grid.best_score_)
0.9565217391304348
In [28]:
for hps, values in grid.best_params_.items():
  print(f"{hps}: {values}")
criterion: gini
max_depth: 5
max_leaf_nodes: 2
min_samples_leaf: 1
min_samples_split: 2
In [29]:
clf = DecisionTreeClassifier(random_state = 25, criterion='gini', max_depth = 5, max_leaf_nodes = 2, min_samples_leaf = 1, min_samples_split = 2 )
clf.fit(X_train, y_train)
pred = clf.predict(X_valid)
print(classification_report(y_valid, pred, target_names = ['Hamilton', 'Madison']))
              precision    recall  f1-score   support

    Hamilton       1.00      1.00      1.00        15
     Madison       1.00      1.00      1.00         5

    accuracy                           1.00        20
   macro avg       1.00      1.00      1.00        20
weighted avg       1.00      1.00      1.00        20

In [30]:
print(f"Accuracy: {round(metrics.accuracy_score(y_valid, pred)*100)}%")
Accuracy: 100.0%

Prediction and Interpretation

In [31]:
x_test = test_Essay_Tree.iloc[:, 2:]
x_test = min_max_scaler.fit_transform(x_test)
y_test = test_Essay_Tree.iloc[:, 0].values
In [32]:
fPCA = pca.fit_transform(features)
PCA_1 = fPCA[:, 0]
PCA_2 = fPCA[:, 1]
PCA_df = pd.DataFrame({'pc1':PCA_1, 'pc2':PCA_2, 'label': label  })
PCA_df.head()
Out[32]:
pc1 pc2 label
0 0.000565 -0.566780 Hamilton
1 -0.156943 0.385950 Hamilton
2 0.116603 0.261055 Hamilton
3 -0.495051 -0.518613 Hamilton
4 -0.312605 -0.194928 Hamilton
In [33]:
#Principal component separation to create 2 dim picture
test_PCA = pca.fit_transform(x_test)
test_PCA_1 = list(test_PCA[:, 0])
test_PCA_2 = list(test_PCA[:, 1])

new_PCA_1 = list(PCA_1) + test_PCA_1
new_PCA_2 = list(PCA_2) + test_PCA_2
new_label = list(label) + list(y_test)

test_pred_df = pd.DataFrame({'pc1':new_PCA_1, 'pc2':new_PCA_2, 'Label': new_label})
test_pred_df.Label.unique()
Out[33]:
array(['Hamilton', 'Madison', 'dispt', 'HM'], dtype=object)
In [47]:
%matplotlib inline
trace0= go.Scatter(x=test_pred_df[test_pred_df.Label == 'Hamilton']['pc1'],
                   y=test_pred_df[test_pred_df.Label == 'Hamilton']['pc2'],
                   name="Hamilton's Cluster",
                   mode ="markers",
                   marker =dict(size=10,color="rgba(15,152,152,0.5)",line=dict(width=1,color="rgb(0,0,0)")))

trace1= go.Scatter(x=test_pred_df[test_pred_df.Label == 'Madison']['pc1'],
                   y=test_pred_df[test_pred_df.Label == 'Madison']['pc2'],
                   name="Madison's Cluster",
                   mode ="markers",
                   marker =dict(size=10,color="rgba(610,18,180,0.5)",line=dict(width=1,color="rgb(0,0,0)")))

trace2= go.Scatter(x=test_pred_df[test_pred_df.Label == 'dispt']['pc1'],
                   y=test_pred_df[test_pred_df.Label == 'dispt']['pc2'],
                   name="Disputed's Cluster",
                   mode ="markers",
                   marker =dict(size=10,color="rgba(150,12,17,0.5)",line=dict(width=1,color="rgb(0,0,0)")))

trace3= go.Scatter(x=test_pred_df[test_pred_df.Label == 'HM']['pc1'],
                   y=test_pred_df[test_pred_df.Label == 'HM']['pc2'],
                   name="HM's Cluster",
                   mode ="markers",
                   marker =dict(size=10,color="rgba(252,252,332,0.5)",line=dict(width=1,color="rgb(0,0,0)")))

fig = go.Figure()
fig.add_trace(trace0)
fig.add_trace(trace1)
fig.add_trace(trace2)
fig.add_trace(trace3)
fig.show(renderer="notebook")
In [35]:
test_pred_Essay_Tree_KM = clf.predict(x_test)
pred_label_Essay_Tree_KM = ['Hamilton' if i==0 else 'Madison' for i in test_pred_Essay_Tree_KM]
In [36]:
pd.DataFrame({'Author': test_Essay_Tree.iloc[:,0].values, 'Predicted Author using K Means' : pred_label_Essay_Tree_KM})
Out[36]:
Author Predicted Author using K Means
0 dispt Madison
1 dispt Hamilton
2 dispt Madison
3 dispt Madison
4 dispt Madison
5 dispt Madison
6 dispt Madison
7 dispt Madison
8 dispt Madison
9 dispt Madison
10 dispt Madison
11 HM Hamilton
12 HM Madison
13 HM Hamilton
In [37]:
test_pred_Essay_Tree_DT = clf.predict(x_test)
pred_label_Essay_Tree_DT = ['Hamilton' if i==0 else 'Madison' for i in test_pred_Essay_Tree_DT]
In [38]:
pd.DataFrame({'Author': test_Essay_Tree.iloc[:,0].values, 'Predicted Author using Decision Tree' : pred_label_Essay_Tree_DT})
Out[38]:
Author Predicted Author using Decision Tree
0 dispt Madison
1 dispt Hamilton
2 dispt Madison
3 dispt Madison
4 dispt Madison
5 dispt Madison
6 dispt Madison
7 dispt Madison
8 dispt Madison
9 dispt Madison
10 dispt Madison
11 HM Hamilton
12 HM Madison
13 HM Hamilton

1.Our Model predicts 9 out of 10 disputed cases to be written by Madison.

2.This means that in purely numerical sense, The data points in the disputed essays are much closer to datapoints in the Madison essays.

3.This is like the final nail in the Coffin. All our Analysis before this indicated that the essays were written by Madison and this further reaffirms our conclusion.

Therefore, Madison is the Author of the Disputed Essays.Mystery Solved!

In [ ]: